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Summary 

When the approximate size of a population is of primary interest and the cost of an exhaus- 
tive enumeration is prohibitive, capture-recapture models can be used to estimate the size of the 
population using only incomplete lists. Log-linear models that express the probability of each 
capture pattern in terms of the relative sizes of list-intersections tend to be biased by hetero- 
geneity of capture probabilities across population units. We introduce a smooth generalization of 
post-stratification that allows the full suite of log-linear modeling tools to be applied at the unit 
level for closed populations. We illustrate the generality and simplicity of our novel approach by 
estimating bird species richness in continental North America. 

Some key words: Capture-recapture; Continuous Covariate; Post-sttatiflcation; Closed Population; Species Richness. 



1. Introduction 

In many statistical applications, it is important to know the total size of a population when only 
samples are available; that is, to estimate how much of the population was not seen. Capture- 
recapture methods attempt to estimate population size from multiple samples and their patterns 
of overlap, an undertaking which requires careful modeling of the sampling process. Populations 
studied using capture-recapture are diverse, including various animal species (Odum & Pontin, 
1961; Pollock et al., 1984), human populations (Chen et al., 2010), and the set of errors in a 
body of computer code (Runeson & Wohlin, 1998). This paper gives a new approach to the 
underlying statistical problem of estimating the size of a population from multiple incomplete 
lists or samples. 

We review some basic capture-recapture concepts before introducing our new approach. In the 
simplest setting, there are two lists. Assume that units can be perfectly matched across lists, so it 
is possible to cross-classify units by list membership as in Table 1 . Here Cij is the count of units 

Table I . A two-list cross-classification array 

List 2 
yes no 

List I 

no coi Coo = ' 

with capture pattern ij. For example, cio is the number of units on List 1 but not on List 2. The 
unknown number of units that are not observed on either list is denoted cqo, and estimating the 
population size amounts to estimating cqq. With three lists, the task is to estimate cooo> and so on. 
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The Petersen estimator is cqo = ciocqi /ch , which can be formalized as a maximum-likelihood 
estimator under certain assumptions (Pollock, 1976). Perhaps the strongest of these assumptions 
is that the lists are independent; the event that a unit is captured on the first list is independent of 
the event that a unit is captured on the second list. Dependence between lists has two sources. The 
first source is unit-level list dependence, such as respondent fatigue, in which previous capture 
directly reduces the probability of subsequent capture. The second source of dependence arises 
indirectly as a consequence of heterogeneity, or variability in capture probabilities across units 
(Fienberg et al., 1999). In particular, when the capture probabilities are positively (negatively) 
correlated across Usts, the Petersen estimator tends to be biased downwards (upwards). 

Both sources of dependence may vary with covariates such as age. Post-stratification is an old 
method of accounting for covariates by partitioning the observed units into a finite collection of 
post-strata. For example, in a human population, observed individuals may be classified by age as 
younger than 30, between 30 and 60, or older than 60. Based on these post-strata, three separate 
classification arrays (as in Table 1) may be analyzed separately, resulting in estimates for the 
three missing cells, which can be summed to get an estimate of the total number of unobserved 
units. Post-stratification allows models that rely on homogeneity to perform reasonably well if 
each of the chosen post-strata is relatively homogeneous. 

Because post-stratification has obvious disadvantages, including the loss of information in the 
discretization of continuous covariates, several modem models allow both unit-level dependence 
and heterogeneity to smoothly depend on covariates. Huggins (1989) and Alho (1990) developed 
logistic regression models for capture probabihty heterogeneity in the two-hst scenario, and Yip 
et al. (2001) extended logistic regression to k lists with a simple respondent fatigue effect. A 
rather general semi-parametric regression model was developed by Hwang & Huggins (2011). 
Several other approaches are closely related: Chen & Lloyd (2002), Zwane & van der Heijden 
(2004), and Stoklosa & Huggins (2012). 

We present a smooth generalization of post-stratification with two stages. The first stage is 
estimating the conditional probability of each capture pattern as a function of the covariates. 
The second stage is imputing the conditional probability of the unobserved capture pattern (no 
captures) by fitting a separate log-linear model for each observed population unit. The log-Unear 
models imply an estimate of the number of unobserved units corresponding to each observed 
unit. Although some existing approaches mentioned above are similar, our framework is espe- 
cially general, admitting both parametric and nonparametric estimation of the conditional prob- 
ability of the capture pattern. More importantly, we allow log-Unear model selection to vary 
over the covariate space. Throughout, we assume that the population is closed, precluding the 
possibility of births, deaths, and migration. 

2. Methods 
2- 1 . Notation and general framework 
Suppose k lists Li, are drawn from a population of unknown size n. Let i = 1, nc 
index the units that are on at least one list. For each unit i and list Lj, let y.y = I(i £ Lj) be the 
indicator that the ith population unit appears on the jth list. Then yi. = {yn, ...jyik) , and y.. is 
the n X A; matrix with ith row yi. . The vector y^. is called the capture pattern of the iih unit. Let 
Xj. denote a. I x q vector of covariates associated with the 2th unit, and x.. is the n x q matrix 
with ith row Xi. . For each i > ric, the pair (xj. , yi. ) is not observed. If xf; is the matrix formed by 
the first ric rows of x.., and y?. is the matrix formed by the first Uc rows of y.., then the observable 
data is the pair of matrices {x':.,y^.). We refer to the pair {x..,y..) as the extended data. 
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Let yk denote the set of binary row vectors of length k, so each j/j. is an element of y^. Let 
Cy := \{i : Ui. = y}\. The array c := {cy : y ^ y^} is the contingency table of counts of units 
in the hsts. In particular, let cq := = n — ric, the number of units that are not observed on 
any list. Assume that yi. is a realization of a random vector Yi.. Let p{i, y) = pr{Yi. = y), the 
probability that unit i has capture pattern y. Then p{i, yi.) = pr{Yi. = yi.). 

We state a key assumption that is necessary for auxiliary-covariate models of heterogeneity 
that has often been left unstated in the literature. Namely, we assume that a smooth function 
r{y,x) exists such that p{i,yi.) = r{yi.,Xi.) {i = 1, ...,n). This is a rather strong assumption, 
requiring that the covariates x fully explain any variation in the capture probabilities. 

Let 0^ denote the zero vector of length k. Define the detection function 'tp{x) = I — r(0, x), 
which is the probability that a unit with covariates x appears in at least one of the lists. The 
Horvitz-Thompson estimator of the population size n can be written as 



n 



The estimator n relies on the detection probabilities for only the units that are observed. To use 
(1), we must estimate the detection function V'. If ijj is known, then n has some nice asymp- 
totic properties. It is easy to verify that E{h) = n. Moreover, h is consistent and asymptotically 
normal if ip{xi.) is uniformly bounded away from and 1 (Alho, 1990). 

The Horvitz-Thompson estimator (1) has been applied using various estimators for the detec- 
tion function ^. We propose a general way of framing these estimators. Define a function 

/ N r{y,x) r{y,x) 
'^[y,x):=^ T r = , . . • (2) 

For each nonzero y G yk, the function 7r(y, x) is the conditional probabihty that a unit with 
covariates x will have capture pattern y, given that the unit is observed on at least one list. 
Conditioning on observation means that these functions are estimable directly from the data 
using any kind of binary regression. Holding y fixed, consider 7r(?/, x) as a function of x, and 

gather these functions into an array 11 := {7r(y, x) : y / 0}. 
Rearranging (2) as r{y, x) = 'ip{x)Tr{y, x), it is easy to see that 

= = ^y^S^y^"^^ ^ 1 (3) 

Eyr{y,x) vr(0,x) + Ej/^o^(y'^) ^(0,x) + l 

Equation (3) makes it seem natural to break the process of estimating t/j into two stages. The 
first stage is to generate estimates n := {7r(y, x) : y ^ 0}, while the second stage is to impute 
an estimator 7r(0, x) from 11. We deal with these two in sections 3 and 4 respectively. 

Plugging the final expression from (3) into (1) leads to estimators involving the sum of the 
unit-level imputations: 

ric ric 

n := Uc + ^7r(0, Xj.), co := n - ric = ^7r(0,a;i.). (4) 
i=i 1=1 

Any estimator 11 is a post-stratifier, and any smooth estimator 11 is a smooth post-stratifier. 
Finally, a smooth post-stratification estimator takes the form of (4), where 7r(0, x) is imputed 
from a smooth estimator 11 (see Section 4). 

Although it has not been stated in such an explicit and general form, smooth post-stratification 
is not new. An early version was derived under the title of local post-stratification (Chen & 
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110 Lloyd, 2002). We advocate the term smooth rather than local because post-stratification is already 
(i.e., redundantly) an attempt at localization; the relevant feature of the new framework is the 
assumption of a smoothness condition that ties post-strata together. 

3. Stage 1: Estimating 11 

There are many ways to estimate the functions in 11, including local logistic multinomial 
115 regression and other parametric regression models. We use a nonparametric conditional den- 
sity estimator by Hall et al. (2004). Let rrij = 1 if the ith unit is captured at least once, and 
rrii = otherwise. Assume that each nii is the outcome of a Bernoulli variable Mj. From (2), 
we have 7r(y, Xi ) = P{Yi. = y\Mi = 1, Xj.) for each y / 0. Suppose that each vector Xi. is a 
realization of some random variable X. Let fmixi.) := P{X = Xj.|Mj = 1) and gM{y,x) := 
120 Tr{y,x)f Mix). Then, 

9Miyi-,Xi.) = P{Yi. = yi.\X = Xi.,Mi = 1)P{X = Xi.\Mi = 1) 
= P{yi.,Xi.\Mi = 1). 

Each of qm and /m can be estimated directly from the observable data (i.e., units with rrii = 1), 
and the conditional density of any nonzero capture pattern y given X = xis 

A nonparametric estimator of gM uses smoothing parameters for both x and for the multino- 
mial outcome y. However, we set the y bandwidth to zero (no smoothing), since Stage 2 of our 
125 two-stage process smooths over y in a more comprehensive way (see Section 4). With no smooth- 
ing over y, we are interested in only the normalized vectors of weights w'^ = {w\, • • • , wl^J such 
that the fitted values for the functions in n are 

ric 

Tr{y,Xi.) = ^wll{y = yt.) {i = 1, ...,nc; y ^ 0). (5) 

t=i 

For example, if ft is a Gaussian smoother, and P{-,D) is the multivariate Gaus- 
sian density centered at Xj. with a covariance matrix D of bandwidth parameters, then 

130 wl = p{xt.,D)/J2tf{xt.,D){i = l,...,nc;t = l,...,nc)j^ 

The array of estimated local cross-classification rates is Hi := {Tt{y, Xi.) : y / 0}. Let a{y) 
denote the array of indicators {I{z = y) : z e yk}. By (5), ftj = Ylt=i '^t^ivt-)^ a weighted- 
average array which gives greatest weight to observations with covariates close to Xj. . For addi- 
tional intuition about this notation, compare Hj with the cross-classification c = Yll=i (^iUi-)- I^i 

135 particular, if all the weights were 1, then each Hi would be the same as the observable part of c. 

4. Stage 2: Imputing 7r(0, x) 
At a high level. Stage 2 involves fitting a log-hnear model Mi to the local cross-classification 
rtj, and then using the fitted model to project a value for the missing cell 7r(0,a;i.) {i = l,...,nc). 
Our framework permits maximal generality by allowing a separate model to be selected for every 
140 distinct point that is observed in the covariate space. 

Stage 2 builds on the result of Stage 1, taking as given the vectors of smoothing weights 
w'' {i = 1, ric). Hence, a mixture of multinomials J2t=i ^l^(^ ) ^^^h conditional probabih- 
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tiespr{a{Yi.) = a{yi.)\Mi = 1} = 7r(yi., Xj.) is assumed to generate lij {i = 1, nc). We now 
discuss the details of selecting and fitting a log-linear model for Xlj. 

4- 1 . Short review of log-linear models 
Log-linear models provide flexible ways to represent the cross-classification c as the out- 
come of a multinomial random variable. Classical log-linear models formally assume homo- 
geneity, which says that p{ii, y) = p(i2, u) ='■ p{y) for all ii,i2 G {1, n}. Independence be- 
tween units is also assumed, so that the array of counts c is a realization of a multinomial ran- 
dom variable with n trials from the probability array {p{y)}yeyk - Given a vector of parameters 
u = (no, ui,U2, uu), a simple log-linear model is 

logp(y; n) = no -I- niyi -|- U2y2 + Ui2yiy2 {y e yk), (6) 

where yj denotes the jth element of the vector y {j = 1, ...,k). The parameters ui,U2 are called 
list effects, and ui2 represents the interaction between the first and second list. If there are more 
than two lists, additional parameters may be included to describe the other list interactions. For 
example, with k Usts, the highest-order list interaction is denoted ni...^. 

Parameter estimates can be found by maximizing the multinomial conditional likelihood 

Le(n|c\co)= . YlTTiy-uyy, (7) 



where 7r(y; u) := p{y; n)/(l — p{6; u)) (Sanathanan, 1972; Fienberg, 1972). Given a maximum 
likeUhood estimate p(0; n), the marginal likelihood 

nc!(n - nc)! 

is maximized over n to obtain a population estimate n. 

The cross-classification c has only 2*^ — 1 observable cells, and a unique maximizer of (7) 
exists for a model with at most 2^^ — 1 parameters. Thus, a model with exactly 2^^ — 1 parameters 
is called saturated, providing a perfect fit for the observed relative frequencies in the cells of c. 
With only two lists, one may take 77,12 := in (6) to get a saturated hierarchical log-linear model, 
and maximizing the conditional and marginal likelihoods gives the Petersen estimator for the 
missing cell, cqo := n - = ciqcoi/ch. 

4-2. Local log-linear models 
Recalling (2), the conditional multinomial probabihties may vary as a function of some vector 
of unit-level covariates such as age or size. The Petersen estimator can be modified trivially to 
get the following unsurprising result: If 7r(y, x) is constant in x, then the conditional maximum 
likelihood estimate of 7r(0, x) = 7r{(0, 0), x} is 

■= vr{(l,l),x} ' 

where 7r(y, x) := Cy/uc {y / 0) is the direct estimate for each observable capture pattern. 

When 7r(y, x) is not constant in x, (8) may still give a consistent estimator. It is easiest to see 
this when x can take on only S different values. Then the data can be partitioned into S classes, or 
post-strata, resulting in S separate cross-classification arrays {c(s)}f^i such that c{s) = c. 
The saturated log-Unear model may be fitted separately on each post-stratum; and the estimate 
of the missing cell is then of the form cqq = J2s ^0 (s) = J2s ^01 (s)coi (s) /cn (s) . In particular. 
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175 (8) clearly applies if we take TTs{y, x) := Cy{sx)/nc{sx) where Sx denotes the index of the post- 
stratum containing x, and nc(s) is the number of observed units in post-stratum s. 

The variability of vr^ tends to increase as the number S of post-strata grows, since fewer 
observations are in each post-stratum. We can control the variance by imposing a smoothness 
assumption across post-strata. For example, can be replaced with a kernel regression or para- 

180 metric regression estimator vr. Assuming that 7r(y, x) is sufficiently smooth to allow consistency 
of an estimator Tr{y,x), it follows that 7r(0, x) as in (8) is consistent. Hence, we localize the 
u-terms by allowing them to take a different value for each covariate vector x, and (6) becomes 

logTr{y;u(x)} = uo{x) + ui{x)yi + U2{x)y2 (y G 3^2) 

for two lists. Similarly, for three lists, a saturated local log-linear model is 

log7r{y]u(x)} =uoix) + ui{x)yi + U2{x)y2 + U3{x)y3 

+ ui2{x)yiy2 + ui^{x)yiy^ + U2^{x)y2yz {y G 3^3)- 

Various models are obtained as submodels of (9) by removing terms. For example, the inde- 
185 pendence model for three lists leaves out the interaction terms to encode the assumption that the 
probability of capture on each list is independent of the event of capture on any other list: 

log7r{y;u(x)} =uo{x) + ui{x)yi + U2ix)y2 + U2,{x)yz. (10) 

4-3. Estimating local log-linear parameters 
The likelihood of local log-Unear model parameters u{x) with respect to the smoothed data 
Hi, ...,nne is 



u{x) \ . 



\{prl Y,w\a{Yt.) = Ili 
i=i [ t=i 

This product is maximized when each individual term is maximized. Regarding each term of the 
product as a local conditional likelihood given the data Hi, let 

t=i 



u{xi.) > {i = l, ...,nc). 



In the special case for which the weights come from a boxcar kernel and the capture pattern 
is homogeneous over the support of the kernel, Lj can be written in a form similar to (7) and 
analyzed by standard methods. Otherwise, exact evaluation of Lj is not straightforward. We 
approximate Lj by pretending that the mixture Ilj is in fact multinomial, as follows. 

For each i G {1, n,} and nonzero y G y^, pick a positive number Aj and define 7r*(y, Xj.) 
to be the unique value such that AjTr* {y, Xj.) is equal to Aj7r(?/, Xj.) rounded to the nearest integer. 
We pick Aj to be large, say Aj = 10^, so that the difference {Tt*{y, Xj.) — 7r(y, Xj.)} is negligi- 
ble. The result is an integer- valued cross-classification array II* = [Ai7r*(y, Xj.) : y 7^ 0] with 
essentially the same proportionality structure as Hi, so that U* « Ajllj. Since the element-wise 
differences (IT* — Ajllj) are negligible, any log-linear model for 11* must fit 11 j equally well in 
terms of any scale-free measure of goodness of fit. Therefore, we fit a log-linear model to IIT 
by assuming that 11* is a draw from a multinomial distribution with Aj trials, with the likelihood 
function 
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Parameter estimates u{xi.) are obtained by maximizing (11) subject to the constraint that the 
multinomial probabilities sum to 1, resulting in the estimate 7r(0, Xj.) := 7r{0, u{xi.)} of the un- 
known 7r(0, Xi.). We refer to this method as pseudo-multinomial maximum likelihood estimation. 

Estimation by pseudo-multinomial maximum likelihood is essentially a sliding-window ver- 
sion of post-stratification if the weights w'^ correspond to a boxcar-like kernel. The tradi- 205 
tional method, fitting log-linear models to disjoint post-strata, and our new approach, pseudo- 
multinomial maximum likelihood on overlapping kernels, may both incur bias if 7r{y, x) is par- 
ticularly variable with respect to x over the support of any given post-stratum or kernel evaluated 
at a fixed Xj.. Also, population estimates have high variance if the post-strata are too small or if 
the smooth post-stratifier 11 undersmooths. 210 

The fact that parameter estimation is done separately for each population unit allows us select 
a separate model at each x. A more rigorous notation for u{x) in (11) could be something like 
u{M-{x)} to reflect that u is the parameter vector corresponding to whichever model Ai is 
selected at x. Specifically, for distinct points in the covariate space xi and X2, parameter vectors 
u{xi) and u{x2) may differ in length. This expanded notation is omitted for brevity. 215 

44. Local Model Selection 

Several old strategies for log-linear model selection exist (Benedetti & Brown, 1978). Fienberg 
(1972) recommended a nuanced and somewhat ad hoc method centered around likelihood ratio 
tests, and stepwise regression based on an information criterion has seen recent use in capture- 
recapture applications (Aaron et al., 2003; Murphy, 2009). Each of these approaches can be 220 
apphed for the local log-hnear model selection problem with appropriate modifications. 

We choose to implement the Schwarz information criterion due to its simplicity and conve- 
nience. With a minor modification, this criterion is applicable to the local model selection prob- 
lem in conjunction with pseudo-multinomial maximum likelihood estimation. The reason for a 
modification is that the apparent number of degrees of freedom in a log-linear fit of IIT has no 225 
connection to reality after the scaling by the arbitrarily chosen Aj. Specifically, a large inflation 
factor Xi implies a large total cell count in 11*, tending to cause overfitting. 

We elaborate on this point. Using the pseudo-multinomial likelihood (11), the standard 
Schwarz information criterion objective function to be minimized is — 21ogL* + glog(Ai), 
where q is the number of log-hnear parameters. Increasing A, means that there are more terms in 230 
the product L*, so —2 log Lj grows at least linearly in A^, while q log(Aj) grows only log-linearly. 
Therefore, when Aj is large, —2 log L| tends to be more important than q log(Aj), leading to the 
selection of a log-linear model with too many parameters. 

Ideally, the number of terms in the pseudo-multinomial likelihood to be used for the infor- 
mation criterion should reflect the number of observed population units underlying the locally 235 
averaged multinomial frequencies lij. A reasonable proxy for this quantity, which we will call 
the local degrees of freedom, is the sum of the conditional regression weights for the ith unit: 
rii := J2^=i "^t- Since the sum of the elements of the array (nj/Aj)n* is rij, we replace the stan- 
dard Schwarz information criterion objective function with — 2(nj/Ai) log(L|) -|-^log(ni). A 
similar modification could be used with the Akaike information criterion. 240 



5. Variance estimation 

The multi-stage structure of our method makes it difficult to compute the variance directly. 
Instead, we simulate the sampling distribution of our estimator using a method that is similar to 
the parametric bootstrap described by Zwane & van der Heijden (2003). 
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245 Let {x^.^ •) denote the covariates of the ric observed units. The fraction l/'il:{xi) in (1) is the 
number of units that are represented by the ith unit. That is, the contribution of the ith unit in 
the Horvitz-Thompson sum n can be decomposed as the sum of 1, representing the ith unit, and 
Oj := — 1 additional units that were not captured. Let o-"* and of'^^ denote the integer 

and fractional components of Oj, respectively, such that Oj = o^"* + of^'^, where all quantities are 

250 non-negative. Let o[ denote the sum of o^"* with the outcome of a BemoulU random variable 
with success probability of^'^. 

For each i G {1, nd, we insert o[ new units with the covariate Xj.. With the insertions, the 
set of covariates represents the population that is assumed by our model, and this set is denoted 
as (x. ., .)**™. Replacing Oj with its random perturbation o' introduces an element of randomness 

255 that is not included in our model, and this should slightly inflate the variance of our simulation 
outcomes, leading to conservative confidence intervals. 

Finally, the capture pattern for the ith individual yj. is simulated as a multinomial random 
variable with unit-level multinomial probabilities f(yi.,Xi.) := V'(xj.)7r(yj., .Xj.). The result is 
a simulated population with covariates and capture patterns observed for all units. 

260 Deleting all units with capture pattern gives the simulated data {x':.,y^.y^'^, and applying a 
smooth post-stratification methods gives an estimate Cq*"* of the number of units not observed. 
Replicating this procedure leads to bootstrap confidence intervals. 



6. Several important local log-linear models 

Our adaptation of the Schwarz information criterion for model selection in Section 44 is 
intended as a first-generation solution to the local model selection problem. In this section, we 
take a small step backwards to discuss several specific log-linear models that are interesting due 
to interpretability or historical significance in the capture-recapture literature. 

The independence model (10) can be further constrained by requiring that the list effects are all 
equal to a single parameter, := ui = ■ ■ ■ = Uk. This leads to a local model of independence 
with equal catch-ability across lists, an extremely sparse model with only one free parameter: 

k 

log7T{y;u{x)} =u{x) + uj:{x)'^yj. (12) 

For basic local log-Unear models with no highest-order interaction term (i.e., ui...k{x) := 0) 
such as (9), (10) , and (12), eliminating all of the explicit u(x)-terms from the model equations 
leads to 

7r(0;n(x)) = |{^^^ , (13) 

where O is the set of capture patterns with entries summing to an odd number, and E is the 
set of nonzero capture patterns summing to an even number (Fienberg, 1972). The saturated 
model (9) is particularly convenient to implement in conjunction with (13) because the model 
always fits the observable data exactly. For example, Zwane & van der Heijden (2004) directly 
estimated 7r(0, x) by substituting the Stage-1 function estimates 7r(y, x) in place of TT{y; u{x)) 
in the right-hand side of formula (13) without giving attention to the log-linear equations. 

The saturated model (9) is appealing in its convenience and flexibility, but the resulting es- 
timates can be extremely unstable due to overfitting. We propose a way to stabihze the es- 
timates by averaging the saturated model (9) with the two-parameter model (12), as follows. 
Let nj(n) = [7r{y; u(xj.)} : y 7^ O] denote the conditional multinomial probabiUties implied by 
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model (12) given parameter estimates u = {uq, us). Let = minyj^o'K{y, Xj.). Recalling that 
rii denotes the effective degrees of freedom for Ilj, define a mixing constant aj G (0, 1) as 

oLi = — , 

1 + UiVi 

and define a weighted average nj(-u, a) := (1 — ai)]li{u) + ajllj {i = 1, ric), where the lin- 28o 
ear combination of arrays is evaluated element-wise. Plugging Ili{u, a) into the right-hand side 
of (13) gives an estimate for 7r(0; u{x)). Heuristically, the constant a, is small, putting greater 
weight on the sparse fit Ili{u), when the effective degrees of freedom is not large enough to 
stabilize the smallest element of the saturated fit Ilj. We call this the adjusted saturated model. 

Finally, we mention a non-hierarchical log-linear model with a single list-interaction parame- 235 
ter that was inspired by the Rasch model for educational testing. The quasi-symmetry model is 
derived from log-normal unit-level random effects and takes the form 

log 7r(y; u) = u + uiyi + U2y2 + u^yz + '"s % > (14) 

with moment restrictions that constrain uy, to be greater than zero (Darroch et al., 1993). The ran- 
dom effects interpretation of the quasi-symmetry model has been invoked to model populations 
with heterogeneous capture probabilities without explicitly controlling for observable covariates 290 
X. Supposing that the covariates x do not fully explain the heterogeneity in a population, one 
could apply the quasi-symmetry model as a local log-linear model (modifying the parameters to 
become functions of x as in previous examples) to model any unexplained heterogeneity at each 
level of X. However, prior to trusting the results of any latent-class or random-effects model for 
heterogeneity, we strongly recommend careful consideration of the non-identifiability studies of 295 
Link (2003a) and Mao (2008). 



7. A Simple Application 
71. How Many Birds Species Can be Observed in the U.S. and Canada? 
We estimate the number of bird species using the North American Breeding Bird Survey for 
continental North America north of Mexico (Sauer et al., 2011). Table 2 displays c, the cross- 300 
classification of species observed in the years 2009 - 2011, treating each year as a separate list. 
For example, exactly 581 species were observed in all three years, and 18 species were observed 
only in 2009. 

Table 2. Cross-classification of species observed over three years 

Not in 2011 
13 
10 
18 
Co 

Define a covariate x as the reverse of the rank ordering of the observed species based on the 
total number of times that each species was observed. For example, the species that was observed 305 
most often over the three years has covariate x = 664, as 664 distinct species were observed. The 
obvious interpretation of x is that species with a high value of x are easy to observe. Compared to 



In 2010 



Not in 2010 



In 2011 
In 2009 581 

Not in 2009 10 
In 2009 11 

Not in 2009 21 
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covariates uses previously to model heterogeneity in the detectability of birds, such as wingspan, 
our covariate appears to be a relatively direct proxy for species detectability. 

310 We estimate the conditional probability functions 11 using the np package (Hayfield & Racine, 
2008) in the R statistical software (R Core Team, 2012). The estimated functions H, the result 
of Stage 1, appear in each panel of Figure 1 in a stacked form. These seven curves sum to the 
horizontal line at height 1, labeled "010", reflecting the identity that the conditional multinomial 
capture pattern probabihties must sum to 1 at each x. We subsequently impute 7r(0, x^) by four 

315 different methods, plotting the results as 664 points in each of the panels of Figure 1 . 

The estimates for the independence model (10) were obtained using pseudo-multinomial max- 
imum likelihood. The result is displayed as the top curve in panel (a), labeled "000". For example, 
near x = 1 the distance between the top curve and the horizontal line below it is nearly 0-5. This 
indicates that the independence model imputes nearly 0-5 unobserved units corresponding to 

320 each observed unit. For all units with x > 100, the independence model imputes approximately 
zero unobserved units, as the top curve is nearly coincident with the next-to-top curve. 

The independence model is valid if the event that a species is observed in one year is indepen- 
dent of the event that this species is observed in another year. However, a positive dependence 
between years is plausible if the experience that a bird watcher gains in sighting a certain rare 

325 species in one year increases the probability of similar sightings in years following. Therefore, it 
may be appropriate to include interaction terms. 

The adjusted saturated model in panel (b) was described in Section 6. Applying the saturated 
model (13) directly is not practical because some of the denominator terms approach zero for 
X > 150, causing the formula to become highly unstable or undefined. 

330 Panel (c) in Figure 1 shows the result of applying a stepwise local log-linear model search 
using the Schwarz information criterion. The discontinuities in the imputation curve (labeled 
"000") mark the points at which the choice of local log-linear model changed. One could easily 
smooth across adjacent models to remove the discontinuities, using something like local Unear 
regression. A more-fundamental issue is that the selected models are extremely sparse. For ex- 

335 ample, the short horizontal section of the imputation curve corresponds to the 0-free-parameter 
model, which has only an intercept term, assigning equal weight (1/7) to all capture patterns. 

The small sizes of the models preferred by the Schwarz information criterion reflect the unfor- 
tunate reality that the effective degrees of freedom is small, on the order of 50, providing limited 
information for a multinomial with seven outcomes. A natural reaction is to try over-smoothing 

340 the conditional density estimate ft to increase the degrees of freedom. Panel (d) shows the out- 
come of over-smoothing, with the regression bandwidth (in x) increased to 250 from the cross- 
validation-selected bandwidth of only 27. The estimate of 724 missing species is implausibly 
high, although it is encouraging that the model imputed no missing species for x > 400. Ex- 
cessive over-smoothing clearly defeats the purpose of local model fitting; choosing an optimal 

345 degree of over-smoothing in this context remains an open problem. 

As a point of reference from the existing capture-recapture literature, we fit the quasi- 
symmetry model (14). Following Darroch et al. (1993), we ignore the restriction us > dur- 
ing parameter estimation (nevertheless, we obtained u^, > for the point estimate), resulting in 
an implausibly large estimate cq = 1744. An alternative is to apply the quasi-symmetry model 

350 locally, since regression smoothing and latent covariates may leave significant unexplained het- 
erogeneity at each level of x. Applying the quasi-symmetry model locally is not straightforward 
in this application due to numeric instability in the estimates where x > 150 (here, most of the 
functions in 11 approach 0). However, if we assume that a negligible number of species is missing 
for X > 150 and consider imputed values only for x < 150, we arrive at a more reasonable point 
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(a) Independence model 

1 2.9 units imputed from 664 observed 



(b) Adjusted saturated model 

50.2 units imputed from 664 observed 
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(c) Models selected by information 
criterion 

10.9 units imputed from 664 observed 
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(d) Models selected by information 
criterion, oversmoothed 

724.1 units imputed from 664 observed 
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Fig. 1. The four panels each display the result of a separate 
estimation method. In each panel, the horizontal axis is the 
species rank x. The relative frequencies of the capture pat- 
terns (i.e., "1 11", "001", ...) are plotted as fimctions of x in 
a stacked form. For example, the curve labeled "001" rep- 
resents the sum 7r{(l, 1, l);M(a;)} -|- 7r{(0, 0, 1); w(a;)}. 
The relative frequencies of observable capture patterns 
sum to 1, the horizontal line, labeled "010". Above 
the horizontal line, the imputed values are plotted as 
7r{(0,0,0),.x-i} + 1 (i = 1, ...,664). The seven relative 
frequency curves plotted in panel (d) are much straighter 
than they appear in the other panels due to intentional over- 
smoothing in the Stage-1 smoothing process. 



355 estimate of cq = 85. Broadly applying this model may require mixing with a simpler model as 
in the adjusted saturated model, and we save this for future work. 
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Bootstrapped 90% confidence intervals and standard errors for the number of unobserved 
species are summarized for several models in Table 3. 

Table 3. Bootstrapped estimates of variability o/cq in several models 



The application of capture-recapture methods to three years of data raises the obvious ques- 

360 tion: Why not extend the model to incorporate all available years of data? Indeed, the Breeding 
Bird Survey data goes back to 1965. However, the assumption of a closed population may fail 
over long spans of time, as certain species go extinct, and new species evolve or change their ge- 
ographic region of preference. Effective population size estimation on a 3 -year moving window 
could, in principle, reveal changes in species richness over time. A separate consideration is that 

365 not using data earlier than 2009 allows us to use the previous years of data as a partial validation 
of our method. The data collection format was standardized in 1997; the data from 1997 to 201 1 
reveals 704 distinct species, strongly suggesting that the independence model (a) and Schwarz 
information criterion model (c) are suboptimal in their current form. 

In applying the Horvitz-Thompson estimator (1) or (4), we ignored the point made by Alho 

370 (1990) that the estimator is not consistent if the detection probabihty ■il^{x) approaches 0, even 
if 'i/j{x) were known. In application, 'ip{x) must be estimated, and the necessary condition is 
arguably much more strict: ip{x) must be substantially greater than zero if an estimate #(0, x) 
is to have a useful level of precision. Specifically, the detection probabilities in the left tail of 
the distribution of x in Figure 1 may be too low to estimate accurately. This point deserves 

375 more attention in many capture-recapture studies, including previous studies using Breeding 
Bird Survey data such as Boulinier et al. (1998) and Dorazio & Royle (2003). 

8. Discussion 

The capture-recapture problem is fundamentally a missing data problem. The missing quantity 
of interest is n, the population size, but, perhaps more relevantly, the covariate values Xj. are 

380 missing for i > fic- The nature of this missingness is arguably of the worst possible kind, because 
it is reasonable to suppose that the units which are not observed are not observed precisely 
because they are different from the observed units, not only in the distribution of covariates but 
also in how capture probabilities depend on covariates. Differences between the training data and 
the test data plague every prediction problem, but it is not generally the case that the prediction 

385 data is different from the training data by default. This difference sets apart capture-recapture as 
an exceptionally difficult and risky estimation task. 

Controlling for covariates is effective only if the observable covariates explain much of the 
heterogeneity in capture probabilities, and this condition is not always attainable. Regardless, 
at least attempting to use available covariates is a necessary first step to understanding hetero- 

390 geneity. The necessity of modeling heterogeneity on covariates has been controversial. Much of 
the capture -recapture literature attempts to incorporate heterogeneity effects without using co- 
variates (Pledger & Phillpot, 2008). However, Link showed that the population size n is often 
not identifiable across alternative heterogeneity models (Link, 2003a). Pledger countered that 



Model 

Independence 
Adjusted Saturated 
Schwarz information criterion 
Quazi-symmetry (non-local) 



se(co) 90% Confidence Interval for cq 



3- 8 (4-3, 16) 
48 (42, 181) 

4- 4 (7-8, 16) 
1940 (640, 4900) 
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model misspecification is a relatively minor concern if several different kinds of models all lead 
to similar estimates (Pledger, 2005), a position that was rejected by Link (2003b). 395 

Smooth post-stratification points to several avenues of future work. One key question stems 
from our two-stage approach. The first stage is estimating the functions in IT, and this involves a 
variable selection and/or bandwidth selection problem. The second bias-variance tradeoff occurs 
in the local selection of log-linear models for imputing 7r(0, x), which may emphasize parsimony 
to a greater or lesser degree. Each of these stages has its own bias-variance tradeoff. Currently, 4oo 
the two trade-offs are optimized separately, perhaps by using cross-validation in the the first 
stage and by some information criterion in the second stage. Aesthetically, and perhaps more 
substantively, it is desirable to unify these two modeling problems. 

Local log-linear models enable a high degree of specificity; model selection (not only model 
fitting) may be performed separately for each observed population unit. While the flexibility to 405 
select a separate model for each population unit raises the specter of overfitting, the smoothed 
nature of 11 ensures that the models are highly correlated across units. Therefore, it is unclear 
how to count degrees of freedom globally for the proposed multi-stage estimation, and a penalty 
on the flexibility of the model selection procedure may be appropriate. 
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